Natural Language Processing
Transcript Wow, a big big thank you to the YouTube's AI algorithm. Thank You, AI. Hello world, it's Siraj and computers are pretty good at learning from spreadsheets of data filled with numbers but we humans communicate with words not with numbers. The subfield of AI called natural language processing or NLP is focused on enabling computers to understand and communicate in human language. In this video, I'll cover how NLP has progressed over the years up until they don't explain how to use a bleeding-edge model called "earth?" that makes it incredibly easy, for anyone not me, for anyone to build NLP apps right now. We'll specifically use Bert to learn from a data set of Amazon reviews that perform text classification and automatic summarization language is a way to represent information and we humans interpret this information from strings of text by assessing three different criteria: Syntax, Semantics and Pragmatics. Basically "meme review". Syntax describes the form of the language usually specified by grammar natural language is much more complicated than the formal languages used for programming there are many rules of syntax to abide by I before E except after C with exceptions semantics describes the meaning of words or sentences of the language and pragmatics describes how the words relate to the world at large. It's about considering their context to understand the difference between these three criteria take a look at the following four sentences the first sentence is appropriate at the start of an article it's syntactically semantically and pragmatically correct the second sentence is syntactically and semantically correct but pragmatically, it sounds kind of wacky. The third sentence is syntactically correct but semantically incorrect and the last sentence is incorrect on all three fronts syntactically semantically and pragmatically. Computer scientists have been creating automatic systems to attempt to do this for just about years now which makes it an incredibly young scientific discipline. We can trace the history of NLP back to when the prominent computer scientists Benedict Cumberbatch, I mean Alan Turing published a landmark paper titled "Computing Machinery and Intelligence" which proposed what's now called the Turing test as a criterion of true intelligence. The question the Turing test poses is can a computer program fool a human into thinking it's a human via conversation a few years later Noam Chomsky a prominent linguist published a book titled syntactic structures which detail a rule-based system of how to structure grammatically correct phrases and this inspired many rule-based approaches to NLP generating a sentence was usually done by pulling syntactic information from a database in fact up until the s most NLP systems were based on a complex set of handwritten rules but in the late s hair meadow became an unfortunate reality on a more related note a revolution in NLP occurred when researchers started using machine learning algorithms for language processing instead of rule-based algorithms mostly because the increase in available computational power allowed for this strategy to outperform rule-based systems some of the earliest used learning algorithms like decision trees produce systems of hard if-then rules similar to existing handwritten rules but as time progressed researchers increasingly favored statistical models which make probabilistic decisions as to what a word or sentence should sound like or mean instead of rule-based decisions nowadays and LP systems like speech recognition software rely on such statistical models to predict which words were likely spoken by a user which are more reliable. Siraj: "Alexa, show me a photograph." Alexa: "playing photograph" No, a class of statistical models called deep neural networks have been the key driver in most of the recent NLP successes across a wide variety of tasks like machine translation automatic summarization and sentiment analysis and in free open source tools like PyTorch, Colab and various text datasets have enabled individuals and teams from across the globe to create powerful applications that use NLP to solve real-world problems. For example, Clavo is a Finnish startup providing an instant site search solution for e-commerce stores they're using text classification to provide relevant search results for shoppers and actionable insights for store owners another startup called English central aims to make learning English much more fun by giving users instant feedback on their pronunciation using speech recognition techniques. Yummly is building a platform for recipe recommendations and search they use NLP to understand analyze and connect users with the recipes they most enjoy. Hold on cowboy or cowgirl before you go build an NLP startup immediately you need to understand one concept really well BERT which stands for bi-directional encoder representations from transformers. I'll explain what each of those words mean in a second Bert is a fully trained language model that Google released just a few months ago and it's been the most significant breakthrough in NLP thus far a language model is able to learn that probability of word occurrence based on examples of text traditionally language models are trained by using the previous and words to predict the next one but Bert is a language model that was trained by using both the previous and next words when making predictions hence the word bi-directional instead of unit directional Bert was used to establish a new state of the art in NLP tasks including question answering sentiment analysis and automatic summarization all of these tasks involve a two-step process train a deep language model on some text data, then give those representations to a task specific model. For the first step of this process, the go-to technique for the past few years is called Word2Vec and it creates word representations also called word vectors. It maps each word in the training data set to a vector that represents some aspect of its meaning so for example the word vector for ting would include information about state and gender these representations are generally trained on large unlabeled data like a wikipedia dump then use of train models on label data for tasks like sentiment analysis this allows models to leverage linguistic data learned from larger data sets the problem with word Tuvok and similar word vector techniques was that they didn't take context into account the word Bank for example would have a different meaning depending on the context it was used in they have trouble capturing the meaning of combinations of words these limitations motivated the use of recurrent networks as language models instead instead of training a model to map a single vector for each word these techniques train a neural network to map a vector to each word based on the entire surrounding context nowadays the transformer a newer type of neural network has eclipsed all variations of recurrent networks for language modeling a transformer consists of an encoder network and a decoder network so the phrase Bert means using a transformer network to create bi-directional encoder representations these representations can then be fed into another model for some specific NLP tasks unlike recurrent networks transformer networks like Bert don't use recurrent connections at all they instead use attention over the word sequence instead attention is defined in neuroscience as the ability to selectively concentrate on one aspect of the environment while ignoring the rest in deep learning we mimic this concept the use of attention mechanisms one way of doing this is to encode an input sequence into not a single fixed vector but instead have a model learn how to generate a vector for each output time step by adding an additional set of weights that will later be optimized so it doesn't just learn what to output it learns how to selectively weigh parts of the input data to maximize the likelihood of the proper output Bert is composed of several attention blocks to prevent it from having ADHD, like I do. Each block transforms the input using matrix operations if we input a sequence of n words the encoder will output a sequence of n tensors. These tensors are used by the decoder to output a sequence of words the architecture is optimized using gradient descent link to how that works in the video description the great thing about bert is that it comes fully trained out of the box it took Google for days using several cloud TP use to Train it on several languages so thank you google I guess all we need to do is fine tune the final layer of Bert on our own training data set for whatever task we choose and it will benefit from Bert's existing knowledge so let's look at our data set of Amazon reviews which we'll use for two tasks text classification and automatic summarization by text classification we're talking about classifying chunks of text that could be anywhere from sentence size to an entire paragraph in length as either a good review or a bad review we'll first clone Bert into our environment then we'll download the bert model files these are weight values that represent what it's learned from pre-training the then we'll need to pre-process our data into a format that art expects column one will be a pro ID column two is the label for the row as an INT column three is a column of all the same letter it's a throwaway column that we need to include because Bert expects it WTF right just roll with it and the text examples will be in the last column the ones we want to classify we can do all this easily with the pandas Python library once we format it our data we can run training once it's finished training we'll use it to predict on new text data by selecting the newly trained weights file as input as well as some test review and it will output a classification either a good or bad review now if we want to perform automatic summarization we can use the same learned embeddings well first need to cluster them then extract one sentence from each cluster this is an unsupervised technique these embeddings will be clustered in high dimensional vector space where the number of clusters is equal to the desired number of sentences we want in the summary each cluster of sentence embeddings can be interpreted as a set of semantically similar sentences whose meaning can be expressed by just one candidate sentence in the summary this candidate sentence is selected to be the sentence whose vector representation is closest to the cluster Center we then order the candidate sentences to form a summary this order is determined by the position of the sentences in their related clusters this technique is considered extractive summarization amazing right there are three things to remember from this video natural language processing is the study of computational techniques to help computers understand and communicate in human languages Google's Bert's model makes it easy for anyone to create an NLP application greatly reducing the amount of training time data and compute necessary and we can perform NLP tasks like text summarization and text classification using berthed what do you want to do with NLP next let me know in the comment section and please subscribe for more programming videos for now I've got to analyze some tag so thanks for watching